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Abstract 

Retinal image of surrounding objects varies tremen¬ 
dously due to the changes in position, size, pose, 
illumination condition, background context, occlu¬ 
sion, noise, and nonrigid deformations. But de¬ 
spite these huge variations, our visual system is 
able to invariantly recognize any object in just a 
fraction of a second. To date, various computa¬ 
tional models have been proposed to mimic the hi¬ 
erarchical processing of the ventral visual pathway, 
with limited success. Here, we show that the as¬ 
sociation of both biologically inspired network ar¬ 
chitecture and learning rule significantly improves 
the models’ performance when facing challenging 
invariant object recognition problems. Our model 
is an asynchronous feedforward spiking neural net¬ 
work. When the network is presented with natural 
images, the neurons in the entry layers detect edges, 
and the most activated ones fire first, while neurons 
in higher layers are equipped with spike timing- 
dependent plasticity. These neurons progressively 
become selective to intermediate complexity vi¬ 
sual features appropriate for object categorization. 
The model is evaluated on 3D-Object and ETH- 
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80 datasets which are two benchmarks for invari¬ 
ant object recognition, and is shown to outper¬ 
form state-of-the-art models, including DeepCon- 
vNet and HMAX. This demonstrates its ability to 
accurately recognize different instances of multi¬ 
ple object classes even under various appearance 
conditions (different views, scales, tilts, and back¬ 
grounds). Several statistical analysis techniques are 
used to show that our model extracts class specific 
and highly informative features. 

Keywords: View-Invariant Object Recognition, 
Visual Cortex, STDP, Spiking Neurons, Temporal 
Coding 

1 Introduction 

Humans can effortlessly and rapidly recognize sur¬ 
rounding objects [T], despite the tremendous vari¬ 
ations in the projection of each object on the 
retina [2] caused by various transformations such as 
changes in object position, size, pose, illumination 
condition and background context [3]. This invari¬ 
ant recognition is presumably handled through hi¬ 
erarchical processing in the so-called ventral path¬ 
way. Such hierarchical processing starts in VI lay¬ 
ers, which extract simple features such as bars and 
edges in different orientations [1], continues in in¬ 
termediate layers such as V2 and V4, which are 
responsive to more complex features [S], and cul- 
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minates in the inferior temporal cortex (IT), where 
the neurons are selective to object parts or whole 
objects [0]. By moving from the lower layers to the 
higher layers, the feature complexity, receptive field 
size and transformation invariance increase, in such 
a way that the IT neurons can invariantly represent 
the objects in a linearly separable manner [ZlIH]. 

Another amazing feature of the primates’ vi¬ 
sual system is its high processing speed. The first 
wave of image-driven neuronal responses in IT ap¬ 
pears around 100 ms after the stimulus onset pa. 
Recordings from monkey IT cortex have demon¬ 
strated that the first spikes (over a short time win¬ 
dow of 12.5 ms), about 100 ms after the image pre¬ 
sentation, carry accurate information about the na¬ 
ture of the visual stimulus [7j. Hence, ultra-rapid 
object recognition is presumably performed in a 
feedforward manner [3j. Moreover, although there 
exist various intra- and inter-area feedback connec¬ 
tions in the visual cortex, some neurophysiologi¬ 
cal piioiia] and theoretical [TT] studies have also 
suggested that the feedforward information is usu¬ 
ally sufficient for invariant object categorization. 

Appealed by the impressive speed and perfor¬ 
mance of the primates’ visual system, computer vi¬ 
sion scientists have long tried to “copy” it. So far, 
it is mostly the architecture of the visual system 
that has been mimicked. For instance, using hi¬ 
erarchical feedforward networks with restricted re¬ 
ceptive fields, like in the brain, has been proven 
useful [HI [la na [El ng [H!- In comparison, the 
way that biological visual systems learn the appro¬ 
priate features has attracted much less attention. 
All the above-mentioned approaches somehow use 
non biologically plausible learning rules. Yet the 
ability of the visual cortex to wire itself, mostly in 
an unsupervised manner, is remarkable [THl [E] • 

Here, we propose that adding bio-inspired learn¬ 
ing to bio-inspired architectures could improve the 
models’ behavior. To this end, we focused on 
a particular form of synaptic plasticity known as 
spike timing-dependent plasticity (STDP), which 
has been observed in the mamalian visual cor¬ 
tex [20l|2l]. Briefly, STDP reinforces the connec¬ 
tions with afferents that significantly contributed 
to make a neuron fire, while it depresses the oth¬ 
ers [22] • A recent psychophysical study provided 


some indirect evidence for this form of plasticity in 
the human visual cortex |23j . 

In an earlier study |21| , it is shown that a combi¬ 
nation of a temporal coding scheme - where in the 
entry layer of a spiking neural network the most 
strongly activated neurons fire first - with STDP 
leads to a situation where neurons in higher visual 
areas will gradually become selective to complex 
visual features in an unsupervised manner. These 
features are both salient and consistently present 
in the inputs. Furthermore, as learning progresses, 
the neurons’ responses rapidly accelerates. These 
responses can then be fed to a classifier to do a 
categorization task. 

In this study, we show that such an approach 
strongly outperforms state-of-the-art computer vi¬ 
sion algorithms on view-invariant object recogni¬ 
tion benchmark tasks including 3D-Object |25l l26] 
and ETH-80 [2Z| datasets. These datasets con¬ 
tain natural and unsegmented images, where ob¬ 
jects have large variations in scale, viewpoint, and 
tilt, which makes their recognition hard [2^, and 
probably out of reach for most of the other bio¬ 
inspired models |29l |30|. Yet our algorithm gen¬ 
eralizes surprisingly well, even when “simple clas¬ 
sifiers” are used, because STDP naturally extracts 
features that are class specific. This point was fur¬ 
ther confirmed using mutual information jSl] and 
representational dissimilarity matrix (RDM) |32j . 
Moreover, the distribution of objects in the ob¬ 
tained feature space was analyzed using hierarchi¬ 
cal clustering |3H|, and objects of the same category 
tended to cluster together. 

2 Materials and methods 

The algorithm we used here is a scaled-up ver¬ 
sion of the one presented in ED- Essentially, 
many more C2 features and iterations were used. 
Our code is available upon request. We used a 
five-layer hierarchical network Si ^ Ci ^ S 2 ^ 
C 2 —>■ classifier, largely inspired by the HMAX 
model ra (see Eig. [^. Specifically, we alternated 
simple cells that gain selectivity through a sum op¬ 
eration, and complex cells that gain shift and scale 
invariance through a max operation. However, our 
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Figure 1 : Overview of our 5 layered feedforward spiking neural network. The network processes the input image in a multi-scale form, each 
processing scale is shown with a different color. Cells are organized in retinotopic maps until the S2 layer (included). Si cells of each processing 
scale detect edges from the corresponding scaled image. Ci maps sub-sample the corresponding Si maps by taking the maximum response over 
a square neighborhood. S2 cells are selective to intermediate complexity visual features, defined as a combination of oriented edges of a same 
scale(here we symbolically represented a triangle detector and a square detector). There is one S1-C1-S2 pathway for each processing scale. 
Then C2 cells take the maximum response of S2 cells over all positions and scales and are thus shift and scale invariant. Finally, a classification 
is done based on the C2 cells’ responses (here we symbolically represented a house/non-house classifier). Ci to S2 synaptic connections are 
learned with STDP, in an unsupervised manner. 


network uses spiking neurons and operates in the 
temporal domain: when presented with an image, 
the first layer’s cells, detect oriented edges and 
the more strongly a cell is stimulated the earlier it 
fires. These spikes are then propagated asyn¬ 
chronously through the feedforward network. We 
only compute the first spike fired by each neuron (if 
any), which leads to efficient implementations. The 
justification for this is that later spikes are probably 
not used in ultra-rapid visual categorization tasks 
in primates j31] . We used restricted receptive fields 
and a weight sharing mechanism (i.e. convolutional 
network). In our model, images are presented se¬ 
quentially and the resulting spike waves are propa¬ 
gated through to the S 2 layer, where STDP is used 
to extract diagnostic features. 

More specifically, the first layer’s S\ cells detect 
bars and edges using Gabor filters. Here we used 


5x5 convolutional kernels corresponding to Ga¬ 
bor filters with the wavelength of 5 and four dif¬ 
ferent preferred orientations (7r/8,7r/4-|-7r/8,7r/2-|- 
7r/8,37r/4-|-7r/8). These filters are applied to five 
scaled versions of the original image: 100%, 71%, 
50%, 30%, and 25% (each processing scale declared 
by a different color in Fig.[^. Hence, for each scaled 
version of the input image we have four maps 
(one for each orientation), and overall, there are 
4x5 = 20 maps of cells (see the Si maps of 
Fig. [^. Evidently, the Si cells of larger scales de¬ 
tect edges with higher spatial frequencies while the 
smaller scales extract edges with lower spatial fre¬ 
quencies. Indeed, instead of changing the size and 
spatial frequency of Gabor filters, we are changing 
the size of input image. This is a way to implement 
scale invariance at a low computational cost. 

Each Si cell emits a spike with a latency that is 
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inversely proportional to the absolnte valne of the 
convolution. Thus, the more strongly a cell is stim¬ 
ulated the earlier it fires (intensity-to-latency con¬ 
version, as observed experimentally [HSl ESI EZ!). 
To increase the sparsity at a given scale and loca¬ 
tion (corresponding to one cortical column), only 
the spike corresponding to the best matching orien¬ 
tation is propagated (i.e. a winner-take-all inhibi¬ 
tion is employed). In other word, for each position 
in the four Si orientation maps of a given scale, the 
cell with highest convolution value emits a spike 
and prevents the other three 5'i cells from firing. 

For each Si map, there is a corresponding Ci 
map. Each Ci cell propagates the first spike emit¬ 
ted by the Si cells in a 7 x 7 square neighborhood of 
the Si map which corresponds to one specific ori¬ 
entation and one scale (see the Ci maps of Fig. [^. 
Cl cells thus execute a maximum operation over 
the Si cells with the same preferred feature across 
a portion of the visual field, which is a biologically 
plausible way to gain local shift invariance [3HI EH] • 
The overlap between the afferents of two adjacent 
Cl cells is just one row, hence a subsampling 
over the Si maps is done by the Ci layers as well. 
Therefore, each Ci map has 6 x 6 = 36 fewer cells 
than the corresponding 5'i map. 

S 2 features correspond to intermediate- 
complexity visual features which are optimum 
for object classification [10]. Each S 2 feature has 
a prototype S 2 cell (specified by a C 1 -S 2 synaptic 
weight matrix), which is a weighted combination 
of bars {Ci cells) with different orientations in 
a 16 X 16 square neighborhood. Each prototype 
S 2 cell is retinotopically duplicated in the five 
scale maps (i.e. weight-sharing is used). Within 
those maps, the S 2 cells can integrate spikes only 
from the four Ci maps of their corresponding 
processing scales. This way, a given S 2 feature is 
simultaneously explored in all positions and scales 
(see S 2 maps of Fig. with same feature prototype 
but in different processing scales specified by 
different colors). Indeed, duplicated cells in all 
positions of all scale maps integrate the spike train 
in parallel and compete with each other. The 
first duplicate reaching its threshold, if any, is the 
winner. The winner fires and prevents the other 
duplicated cells in all other positions and scales 


from firing through a winner-take-all inhibition 
mechanism. Then, for each prototype, the winner 
S 2 cell triggers the unsupervised STDP rule and 
its weight matrix is updated. The changes in its 
weights are applied over all other duplicate cells 
in different positions and scales (weight sharing 
mechanism). This allows the system to learn 
frequent patterns, independently of their position 
and size in the training images. 

The learning process begins with S 2 features ini¬ 
tialized by random numbers drawn from a normal 
distribution with mean 0.8 and STD 0.05, and the 
threshold of all S 2 cells is set to 64 (= 1/4 x 16 x 16). 
Through the learning process, a local inhibition be¬ 
tween different S 2 prototype cells is used to pre¬ 
vent the convergence of different S 2 prototypes to 
similar features: when a cell fires at a given posi¬ 
tion and scale, it prevents all the other cells (inde¬ 
pendently of their preferred prototype) from firing 
later at the same scale and within a neighborhood 
around the firing position. Thus, the cell popula¬ 
tion self-organizes, each cell trying to learn a dis¬ 
tinct pattern so as to cover the whole variability 
of the inputs. Moreover, we applied a k-winner- 
take-all strategy in S 2 layer to ensure that at most 
two cells can fire for each processing scale. This 
mechanism, only used in the learning phase, helps 
the cells to learn patterns with different real sizes. 
Without it, there is a natural bias toward “small” 
patterns (i.e., large scales), simply because corre¬ 
sponding maps are larger, and so likeliness of firing 
with random weights at the beginning of the STDP 
process is higher. 

A simplified version of STDP is used to learn the 
Cl — S 2 weights as follows: 

f AtCjj = - Wij), if tj -ti<0, 

\ Awij = - Wij), if 

where i and j respectively refer to the index of post- 
and presynaptic neurons, ti and tj are the corre¬ 
sponding spike times, Awij is the synaptic weight 
modification, and and a~ are two parameters 
specifying the learning rate. Note that the exact 
time difference between two spikes {tj — tj) does 
not affect the weight change, but only its sign is 
considered. These simplifications are equivalent to 


4 


assuming that the intensity-to-latency conversion 
of Si cells compresses the whole spike wave in a 
relatively short time interval (say, 20 — 30 ms), so 
that all presynaptic spikes necessarily fall close to 
the postsynaptic spike time, and the time lags are 
negligible. The multiplicative term Wij.{l — Wij) 
ensures the weights remain in the range [0,1] and 
maintains all synapses in an excitatory mode. The 
learning phase starts by which is mul¬ 

tiplied by 2 after each 400 postsynaptic spikes up 
to a maximum value of 2“^. A fixed joT ratio 
(-4/3) is used. This allows us to speed up the con¬ 
vergence of 5*2 features as the learning progresses. 
Initiation of the learning phase with high learning 
rates would lead to erratic results. 

For each S^ prototype, a C-i cell propagates the 
first spike emitted by the corresponding S 2 cells 
over all positions and processing scales, leading to 
the global shift- and scale-invariant cells (see the 
C 2 layer of Fig. [^. 

3 Experimental Results 

3.1 Dataset and Experimental 
Setup 

To study the robustness of our model with re¬ 
spect to different transformations such as scale 
and viewpoint, we evaluated it on the 3D-Object 
and ETH-80 datasets. The 3D-Object is provided 
by Savarese et al. at CVGLab, Stanford Univer¬ 
sity |23]. This dataset contains 10 different object 
classes: bicycle, car, cellphone, head, iron, monitor, 
mouse, shoe, stapler, and toaster. There are about 
10 different instances for each object class. The 
object instances are photographed in about 72 dif¬ 
ferent conditions: eight view angles, three distances 
(scales), and three different tilts. The images are 
not segmented and the objects are located in differ¬ 
ent backgrounds (the background changes even for 
different conditions of the same object instance). 
Figure [^presents some examples of objects in this 
dataset. 

The ETh-80 dataset includes 80 3D objects in 
eight different object categories including apple, 
car, toy cow, cup, toy dog, toy horse, pear, and 


tomato. Each object is photographed in 41 view¬ 
points with different view angles and different tilts. 
Figure SI in Supplementary Information provides 
some examples of objects in this dataset from dif¬ 
ferent viewpoints. 

For both datasets, five instances of each object 
category are selected for the training set to be used 
in the learning phase. The remaining instances 
constitute the testing set which is not seen dur¬ 
ing the learning phase, but is used afterward to 
evaluate the recognition performance. This stan¬ 
dard cross-validation procedure allows to measure 
the generalization ability of the model beyond the 
specific training examples. Note that for 3D-Object 
dataset, the original size of all images were pre¬ 
served, while the images of ETH-80 dataset are re¬ 
sized to 300 pixels in height while preserving the 
aspect ratio. The images of both datasets were con¬ 
verted to grayscale values. 

As already mentioned, the building process of S 2 
features is performed in a completely unsupervised 
manner. Hence, through the execution of the un¬ 
supervised STDP-based learning, the training im¬ 
ages are randomly fed into the model (without con¬ 
sidering their class labels, viewpoints, scales, and 
tilts). The learning process starts with initial ran¬ 
dom weights and finishes when 600 spikes have oc¬ 
curred in each S 2 map. Then STDP is turned off, 
and the ability of the obtained features to invari- 
antly represent different object classes is evaluated. 
To compute the corresponding C 2 feature vector 
for each input image, the thresholds of C 2 neu¬ 
rons are set to infinite, and their final potentials are 
evaluated, after propagating the whole spike train 
generated by each image. Each final potential can 
be seen as the number of early spikes in common 
between the current input and a stored prototype 
(this is very similar to the tuning operation of S 
cells in HMAX). Then, a one-versus-one multiclass 
linear support vector machine (SVM) classifier is 
trained based on the C 2 features of the training set 
and it is evaluated on the test set. 

We have compared the performance of our 
model with the HMAX model m and deep su¬ 
pervised convolutional network (DeepConvNet) by 
Krizhevsky et. al [H]. Comparison with the 
HMAX model is particularly instructive, since as 
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View 1 View 2 View 3 View 4 View 5 View 6 View 7 View 8 



Scale 1 Scale 2 Scale 3 Tilt 1 Tilt 2 Tilt 3 


Figure 2: Some images of the head class of 3D-Object dataset in different A) views, B) scales, and C) tilts. 


Table 1: Performance of our model, HMAX, and DeepConvNet with different number of features. 


Dataset 


Our model 

HMAX 

DeepConvNet 

3D-Object 

# Features 

200 

300 

400 

500 

1000 

3000 

9000 

12000 

4096 

Accuracy 

76.1% 

94 . 7 % 

96.0% 

96.0% 

58.2% 

60.1% 

61.9% 

62.4% 

85.8% 

ETH-80 

# Features 

500 

750 

1000 

1250 

500 

1000 

2000 

5000 

4096 

Accuracy 

75.3 

79 . 3 % 

80.7% 

81.1% 

66 . 3 % 

68.7% 

68 . 9 % 

69.0% 

79 . 1 % 


explained above, we use very similar architecture, 
tuning and maximum operations. The main dif¬ 
ference is that instead of using an unsupervised 
learning rule like us, the HMAX model uses ran¬ 
dom crops from the training images to imprint the 
S2 features (here of equal size). Then a SVM clas¬ 
sifier was trained over the HMAX C2 features to 
complete the object recognition process. The em¬ 
ployed HMAX model is implemented by Mutch, 
et al. [H] and the codes are publicly available at 
http://cbcl. mit. edu/j mutch/cns/index, html. 

We also compared our model with DeepCon¬ 
vNet which has been shown to be the best algo¬ 
rithm in various object classification tasks includ¬ 
ing the ImageNet LSVRC-2010 contest [lEj. It is 
comprised of eight consecutive layers (five convo¬ 
lutional layers followed by three fully connected 
layers) with about 60 millions parameters, learned 
with stochastic gradient descent. We have used a 
pre-trained DeepConvNet model implemented by 
Jia, et al. [12], whose code is also available at 
http://caffe.berkeleyvision.org. The training was 
done over the ILSVRC2012 dataset (a subset of 
ImageNet) with about 1.2 million images in 1000 
categories. We fed the training and testing images 
into DeepConvNet and extracted the feature values 
from the 7th layer. Again, a SVM is used to do the 
object recognition based on the extracted features. 


3.2 Performance analysis 

Table [T] provides the accuracy of our model in 
category classification independently of view, tilt, 
and scale, when different number of S 2 features 
are learned by the STDP-based learning algorithm. 
The results indicate that the model reaches a high 
classification performance on 3D-Object dataset 
with about 300 C 2 features only (about 30 features 
per class). The performance is flattened around 
96% for feature vectors of size greater than 400. 
Also, for the ETH-80 dataset, the model attains 
to a reasonable recognition accuracy of about 81% 
with only 1250 extracted features. We have also 
performed the same experiments on HMAX and 
DeepConvNet models which their accuracies are 
also provided in Table 

Performance of the HMAX model was weak on 
both datasets, which is not too surprising, several 
previous studies have shown that the performance 
of the HMAX model extensively decreases when 
facing significant object transformations p8l [29]. 
Given the structural similarities between our model 
and HMAX, the superiority of our model is presum¬ 
ably related to the unsupervised feature learning. 
Indeed, most of the randomly extracted S2 patches 
in HMAX tend to be redundant and irrelevant, as 
we will see in the next section. 
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DeepConvNet reached a mean performance of 
about 86% on 3D-0bject and about 79% on ETH- 
80 dataset. Thus, our model outperforms Deep¬ 
ConvNet on both datasets, which itself significantly 
outperforms HMAX. It should be noted that the 
images of each object in these two datasets are 
highly varied (e.g., in 3D-Object dataset, there is 
a 45° difference between two successive views of an 
object) and it has previously been shown that the 
performance of DeepConvNet drops when facing 
such transformations [3^. Another drawback of 
DeepConvNet is that, due to the large number of 
parameters, it needs to be trained over millions of 
images to avoid overfftting [IH] (here we avoided 
this problem by using a pre-trained version, but 
doing the training on about 3500 images we used 
here would presumably lead to massive overfitting). 
Conversely, our model is able to learn objects using 
much fewer images. 

Consequently, the results indicate that our model 
has a great ability to learn diagnostic features toler¬ 
ating transformations and deformations of the pre¬ 
sented stimulus. 

3.3 Feature Analysis 

In this section, we demonstrate that unsupervised 
STDP learning algorithm extracts informative and 
diagnostic features by comparing them to the ran¬ 
domly picked HMAX features. To this end, we have 
used several feature analysis techniques: represen¬ 
tational dissimilarity matrices, hierarchical cluster¬ 
ing, and mutual information. We performed the 
same analyses on both datasets and obtained sim¬ 
ilar results. Hence, the results of ETH-80 are pre¬ 
sented in Supplementary Information. 

Extraction of diagnostic features let our model 
reach high classification performances with a small 
number of features (c.f. Table [^. To understand 
why this is true, we first reconstructed the features’ 
preferred stimuli. Given that each S 2 neuron re¬ 
ceives spikes from Ci neurons responding to bars 
in different orientations, the representation of the 
preferred features of S 2 neurons could be recon¬ 
structed by convolving their weight matrices with a 
set of kernels representing oriented bars. In Fig. 
the receptive fields of activated S 2 neurons along 


with the representation of their preferred stimuli 
are illustrated (Fig. S2 provides the same illustra¬ 
tion for the ETH-80 dataset). This demonstrates 
that only a small number of S 2 neurons are required 
to represent the input objects. In other words, the 
obtained features are compatible with the sparse 
coding theory in visual cortex. In addition, for an 
input image, the most activated S 2 neurons cover 
the input objects and they do not respond to the 
background area. Indeed, the STDP learning al¬ 
gorithm naturally focuses on what are common in 
the training images, which are the target object fea¬ 
tures. The backgrounds are generally not learned 
(at least not in priority), since they are almost al¬ 
ways too different from one image to another and 
the STDP process cannot converge on them. 

To characterize the neuronal population coding 
in the C 2 layer of the model and to study the qual¬ 
ity of C 2 features, we used the representational dis¬ 
similarity matrix (RDM) [H2])- Each element of 
the RDM reflects the measure of dissimilarity (dis¬ 
tance) among the neural activity patterns (i.e., the 
object representations) associated with two differ¬ 
ent image stimuli. The distance we used here is 
1 —Pear son correlation. In an RDM corresponding 
to a perfect model, the representations of the ob¬ 
jects of the same category have low dissimilarities 
(i.e., highly correlated), whereas objects of different 
categories are represented highly dissimilarly (i.e., 
uncorrelated). Hence, if we group the rows and 
columns of the RDM of a perfect model based on 
object categories, it is expected to see squares of 
low dissimilarity values around the main diagonal, 
each of which corresponds to pairs of same-category 
images, while other elements have higher values. 

Here, to plot the RDM of each view angle, first, 
the images of all input instances which are taken in 
that view are picked. Then the corresponding RDM 
is plotted by computing the pairwise dissimilarity 
of the values of C 2 features associated with each 
pair of images. Figure [^presents the RDMs of our 
model for all eight views (see Fig. S3 for ETH-80). 
In each RDM, rows and columns are sorted based 
on image categories. Also, a sample image of each 
category is placed next to the rows and columns 
which correspond to that category. Here, we used 
a color-code to represent RDMs which ranges from 
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(a) (b) (c) 


Figure 3: Three S 2 feature prototypes selective to the a) bicycle, b)face, and c) cellphone classes of 3D-object dataset along with their 
reconstructed preferred stimuli. It can be seen that the features converged to specific and salient object parts and neglected the irrelevant 
backgrounds. 



(a) View 1. 


(b) View 2. 


(c) View 3. 


(d) View 4. 



(e) View 5. 


(f) View 6. 


(g) View 7. 


(h) View 8. 
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Dissimilarity measure 


Figure 4: RDMs of our model on 3D-Object dataset corresponding to different viewpoints. It can be seen that within class 
are very low (the blue squares around the main diagonal where rows and columns correspond to images of the same category), 
class dissimilarities are higher (more yellowish). Note that due to the absence of image samples for some views of the monitor 
eliminated this class from the RDMs. 


dissimilarities 
while between 
class, we have 









































































(a) View 1. 


(b) View 2. 


(c) View 3. 


(d) View 4. 




(e) View 5. 


(f) View 6. 


(g) View 7. 


(h) View 8. 




Dissimilarity measure 


Figure 5: RDMs of the HMAX model on 3D-Object dataset corresponding to different viewpoints. Randomly selected features in HMAX model 
are not able to similarly represent within-category objects and dissimilarly represent between-category objects. Note that due to the absence 
of image samples for some views of the monitor class, we have eliminated this class from the RDMs. 


pure blue to pure yellow demonstrating low to high 
dissimilarities, respectively. It can be seen that the 
within-category dissimilarity values (identified by 
blue squares around the main diagonal) are rel¬ 
atively lower than the between-category dissimi¬ 
larities (more yellowish areas). As expected, the 
RDMs indicate that the obtained performance is 
not due to the capabilities of the classifier, but to 
the extraction of diagnostic and highly informative 
C 2 features through STDP. 

We have also computed the RDMs of the HMAX 
model (including 12000 features) for eight views, 
as provided in Fig. (see Fig. S4 for ETH-80). 
As it can be seen in this figure, the randomly se¬ 
lected features in the HMAX model are unable to 
similarly represent within-category objects and dis¬ 
similarly represent between-category objects. This 
is probably due to uninformative features used by 


the HMAX model. Indeed, in HMAX, the task 
of selecting the informative features is left to the 
classifier. We also note the presence of horizon¬ 
tal and vertical yellow lines, indicating “outliers”, 
whose representation lies far away from all the oth¬ 
ers. This indicates that the features do not pave 
well the stimulus space. 

To see how well the stimuli are distributed in the 
high dimensional feature space, we performed hier¬ 
archical clustering over the test set. The clustering 
procedure is started by considering each stimulus 
as a discrete cluster node, continued by connect¬ 
ing the closest nodes into a new combined cluster 
node, and completed by connecting all the stimuli 
to a single node. We performed this analysis on the 
C 2 feature vectors corresponding to all objects in 
all views, scales, and tilts. The obtained hierarchy 
for our model is displayed in Fig. (see Fig. S5 for 
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Figure 6: The hierarchy of clusters and their labels for our model on 3D-Object dataset. The label of each cluster indicates the class with the 
highest frequency in that cluster. It can be seen that the samples of each class are placed in close clusters. The cardinality of each cluster, C, 
and the cardinality of the class with the highest frequency, if, are placed below the cluster label as HjC. 



Figure 7: The hierarchy of clusters and their labels for the HMAX model on 3D-Object dataset. The label of each cluster indicates the class 
with the highest frequency in that cluster. It can be seen that the majority of the objects are assigned to a small number of clusters and 
samples of each class are not well placed in close clusters. The cardinality of each cluster, (7, and the cardinality of the class with the highest 
frequency, if, are placed below the cluster label as HjC. 


ETH-80). The distance between a pair of cluster 
nodes is computed by measuring the dissimilarity 
among their centers (the average of cluster mem¬ 
bers). Due to the large number of stimuli, it is not 
possible to plot the whole hierarchy, hence, only 
the high level clusters are shown in this figure. For 
each lowest level cluster, the class with the highest 
frequency is illustrated by an image label. The car¬ 
dinality of this class as well as the cardinality of the 
cluster are shown below the labels. It can be seen 


that the instances of each object class are placed 
in neighboring regions of the feature space. By 
considering the obtained hierarchical clustering and 
the classification accuracies, it can be concluded 
that the C 2 features are able to invariantly repre¬ 
sent the objects in such a way that the classifier 
can easily separate them. 

The same hierarchical clustering is performed for 
the HMAX feature space (with 12000 features), as 
depicted in Fig. (see Fig. S6 for FTH-80). As it 
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can be seen, the majority of clusters are small, and 
contrary to our model, the distances between the 
clusters are very low. In other words, the objects 
are densely represented in a small area of such a 
high dimensional feature space. Furthermore, the 
mean intra- and inter-class dissimilarities in our 
model are equal to 0.40 and 0.70, respectively, while 
these statistics for the HMAX model are equal to 
0.27 and 0.29, respectively. In summary, it can be 
concluded that the distribution of the object classes 
are dense and highly overlapped in the HMAX fea¬ 
ture space, while the object classes are well sepa¬ 
rated in the feature space of our model. 

In an other experiment, we analyzed the class de¬ 
pendency of the C 2 features for our model. To this 
end, the 50 most informative features, when clas¬ 
sifying a specific class against all the other classes, 
are selected by employing the mutual information 
technique. In other words, for each class, we se¬ 
lected those 50 features which have the highest ac¬ 
tivity for samples of that class and have less ac¬ 
tivity for other classes. Afterwards, the number of 
common features among the informative features 
of each pair of classes are computed as provided in 
TableOn average, there are only about 5.4 com¬ 
mon features between pairs of classes. Although 
there are some common features between any two 
classes, their co-occurrence with the other features 
help the classifier to separate them from each other. 
In this way, our model can represent various object 
classes with a relatively small number of features. 
Indeed, exploiting the intermediate complexity fea¬ 
tures, which are not common in all classes and are 
not very rare, can help the classifier to discriminate 
instances of different classes [iH] . 

3.4 Random features and simple 
classifier 

In a previous study [Hj, it has been shown that 
using the HMAX model with random dot pat¬ 
terns in the S 2 layer can reach a reasonable perfor¬ 
mance, comparable to the one obtained with ran¬ 
dom patches cropped from the training images. It 
seems that this is due to the dependency of HMAX 
to the application of a powerful classiffer. Indeed, 
the use of both random dot or randomly selected 


patches transform the images into a complex and 
nested feature space and it is the classifier which 
looks for a complex signature to separate object 
classes. The deficiencies emerge when the classi¬ 
fication problem gets harder (such as invariant or 
multiclass object recognition problems) and then 
even a powerful classifier is not able to discriminate 
the classes [2HI EH]- Here, we show that the superi¬ 
ority of our model is due to the informative feature 
extraction through a bio-inspired learning rule. To 
this end, we have compared the performances on 
3D-Object dataset obtained with random features 
versus STDP features, as well as a very simple clas¬ 
sifier versus SVM. 

To generate random features, we have set the 
weight matrix of each S 2 feature of our model with 
random values. First, we have computed the mean 
and standard deviation (STD) (253 ±21) of the 
number of active (nonzero) weights in the features 
learned by STDP. Second, for each random feature, 
the number of active weights. A, is computed by 
generating a random number based on the obtained 
mean and STD. Finally, a random feature is con¬ 
structed by uniformly distributing the N randomly 
generated values in the weight matrix. 

In addition, we designed a simple classifier com¬ 
prised of several one-versus-one classifiers. For each 
binary classifier, two subset of C 2 features with high 
occurrence probabilities in one of the two classes 
are selected. In more details, to select suitable fea¬ 
tures for the first class, the occurrence probabilities 
of C 2 features in this class are divided by the cor¬ 
responding occurrence probabilities in the second 
class. Then, a feature is selected if this ratio is 
higher than a threshold. The optimum threshold 
value is computed by a trial and error search in 
which the performance over the training samples is 
maximized. To assign a class label to the input test 
sample, we performed an inner product on the fea¬ 
ture value and feature probability vectors. Finally, 
the class with the highest probability is reported 
to the combined classifier. The combined classifier 
selects the winner class based on a simple majority 
voting. 

For 500 random features, using the SVM and the 
simple classifier, our model reached classification 
performances of 71% and 21% on average, respec- 
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Table 2: The number of common features between each pair of classes of 3D-Object dataset. The gray level of each cell indicates the relative 
distance of the cell value to the maximum possible value (=50). 


Class 

Bicycle 

Car 

Cellphone 

Head 

Iron 

Monitor 

Mouse 

Shoe 

Stapler 

Toaster 

Bicycle 

50 

0 

1 

3 

6 

4 

12 

6 

5 

5 

Car 

0 

50 

1 

14 

10 

11 

1 

2 

6 

7 

Cellphone 

1 

1 

50 

0 

3 

4 

2 

0 

10 

9 

Head 

3 

14 

0 

50 

0 

0 

10 

16 

2 

2 

Iron 

6 

10 

3 

0 

50 

21 

0 

4 

12 

6 

Monitor 

4 

11 

4 

0 

21 

50 

0 

0 

8 

14 

Mouse 

12 

1 

2 

10 

0 

0 

50 

2 

2 

5 

Shoe 

6 

2 

0 

16 

4 

0 

2 

50 

0 

0 

Stapler 

5 

6 

10 

2 

12 

8 

2 

0 

50 

20 

Toaster 

5 

7 

9 

2 

6 

14 

5 

0 

20 

50 


tively. Whereas, for the learned S 2 features, both 
the SVM and simple classifiers attained reasonable 
performances of 96% and 79%, respectively. Based 
on these results, it can be concluded that the fea¬ 
tures obtained through the bio-inspired unsuper¬ 
vised learning projects the objects into an easily 
separable space, while the feature extraction by se¬ 
lection of random patches (drawn from the training 
images) or by generation of random patterns leads 
to a complex object representation. 


4 Discussion 

Position and scale invariance in our model are 
built-in, thanks to weight sharing and scaling pro¬ 
cess. Conversely, view-invariance must be obtained 
through the learning process. Here, we used all 
images of five object instances from each category 
(varied in all dimensions) to learn the S 2 visual 
features, while images of all other object instances 
of each category were used to test the network. 
Hence, the model was exposed to all possible vari¬ 
ations during the learning to gain view-invariance. 
Moreover, near or opposite views of the same object 
shares some features which are suitable for invari¬ 
ant object recognition. For instance, consider the 
overall shape of a head, or close views of a bike 
wheel which could be a complete circle or an el¬ 
lipse. Regarding the fact that STDP tends to learn 
more frequent features in different images, different 
views of an object could be invariantly represented 


based on more common features. 


Our model appears to be the best choice when 
dealing with few object classes, but huge variations 
in view points. As pointed out in previous stud¬ 
ies, both HMAX and DeepConvNet models could 
not handle these variations perfectly |2H1 EHl 1^ . 
Conversely, our model is not appropriate to handle 
many classes, which requires thousands of features, 
like in the ImageNet contest, because its time com¬ 
plexity is roughly in where N is the number 
of features (briefly: since the number of firing neu¬ 
rons per image is limited, if the number of features 
is doubled, reaching convergence will take roughly 
twice as many images, and the processing time for 
each of them will be doubled as well). For ex¬ 
ample, extracting 4096 features in our model, the 
same number of features in DeepConvNet, would 
take about 67 times it took us to extract 500. 
However, parallel implementation of our algorithm 
could speed-up the computation time by several or¬ 
ders of magnitude [13] . Even in this case, we do not 
expect to outperform the DeepConvNet model on 
the ImageNet database, since only the shape simi¬ 
larities are taken into account in our model and the 
other cues such as color or texture are ignored. 


Importantly, our algorithm has a natural ten¬ 
dency to learn salient contrasted regions m. which 
is desirable as these are typically the most informa¬ 
tive ^Hj- Most of our C 2 features turned out to be 
class-specific, and we could guess what they repre¬ 
sent by doing the reconstructions (see Fig. S2 and 
Fig. S2). Since each feature results from averaging 
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multiple input images, the specificity of each in¬ 
stance is averaged out, leading to class archetypes. 
Consequently, good classification results can be ob¬ 
tained using only a few features, or even using ‘sim¬ 
ple’ decision rnles like feature counts j2lj and ma- 
iority voting (here), as opposed to a ‘smart classi¬ 
fier’ such as SVM. 

There are some similarities between STDP-based 
feature learning, and non-negative matrix factor¬ 
ization HU, as first intuited in [18], and later 
demonstrated mathematically in [IHj. Within both 
approaches, objects are represented as (positive) 
sums of their parts, and the parts are learned by 
detecting consistently co-active inpnt nnits. 

Our model could be efficiently implemented in 
hardware, for example using address event rep¬ 
resentation (AER) [nni EH eh esi- with aer, 
the spikes are carried as addresses of sending or 
receiving nenrons on a digital bus. Time ‘repre¬ 
sents itself’ as the asynchronous occurrence of the 
event [Slj. Thus the use of STEP will lead to a 
system which effectively becomes more and more 
reactive, in addition to becoming more and more 
selective. Eurthermore, since biological hardware 
is known to be incredibly slow, simnlations conld 
run several order of magnitude faster than real 
time [nS]- As mentioned earlier, the primate vi¬ 
sual system extracts the rough content of an image 
in abont 100ms. We thus speculate that some ded¬ 
icated hardware will be able to do the same in the 
order of a millisecond or less. 

Recent compntational [iO], psychophysical |56] . 
and fMRI [57] experiments demonstrate that the in¬ 
formative intermediate complexity featnres are op¬ 
timal for object categorization tasks. But the pos¬ 
sible neural mechanisms to extract such features 
remain largely nnknown. The HMAX model ig¬ 
nores these learning mechanisms and imprints its 
features with random crops from the training im¬ 
ages dEHl, or even uses random filters mi ESI- 
Most individual features are thus not very informa¬ 
tive, yet in some cases, a ‘smart’ classifier such as 
SVM can efficiently separate the high-dimensional 
vectors of population responses. 

Many other models use supervised learning 
rules da EH], sometimes reaching impressive per¬ 
formance on natnral image classification tasks [T6| . 


The main drawback of these snpervised meth¬ 
ods, however, is that learning is slow and requires 
numerous labeled samples (e.g., about 1 million 
in dsi), because of the credit assignment prob¬ 
lem [601 EH. This contrasts with humans who can 
generalize efficiently from just a few training ex¬ 
amples |43]. We avoid the credit assignment prob¬ 
lem by keeping the C 2 featnres fixed when train¬ 
ing the final classifier (that being said, fine-tuning 
them for a given classification problem would prob¬ 
ably increase the performance of our model [13162]; 
we will test this in future studies). Even if the 
efficiency of such hybrid unsupervised-supervised 
learning schemes has been known for a long time, 
few alternative nnsupervised learning algorithms 
have been shown to be able to extract complex and 
high-level visnal featnres (see dsiini). Einding 
better representational learning algorithms is thus 
an important direction for future research and seek¬ 
ing for inspiration in the biological visnal systems 
is likely to be fruitfnl [13]- We suggest here that 
the physiological mechanism known as STDP is an 
appealing start point. 

Considering the time relation among the incom¬ 
ing inputs is an important aspect of spiking neural 
networks. This property is critical to promote the 
existing models from static vision to continuons vi¬ 
sion [63] . A prominent example is the trace learning 
rule [M], suggesting that the invariant object rep¬ 
resentation in ventral visual system is instructed 
by the implicit temporal contiguity of vision. Also, 
in various motion processing and action recogni¬ 
tion problems [65], the important information lies 
in the appearance timing of input features. Our 
model has this potential to be extended for contin¬ 
uous and dynamic vision - something that we will 
further explore. 

5 Conclusions 

To date, various bio-inspired network architectnres 
for object recognition have been introduced, but 
the learning mechanism of biological visual systems 
has been neglected. In this paper, we demonstrate 
that the association of both bio-inspired network 
architecture and learning rule results in a robnst 
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object recognition system. The STDP-based fea¬ 
ture learning, used in our model, extracts frequent 
diagnostic and class specific features that are ro¬ 
bust to deformations in stimulus appearance. It 
has previously been shown that the trivial models 
can not tolerate the identity preserving transfor¬ 
mations such as changes in view, scale, and po¬ 
sition. To study the behavior of our model con¬ 
fronted with these dilficulties, we have tested our 
model over two challenging invariant object recog¬ 
nition databases which includes instances of 10 
different object classes photographed in different 
views, scales, and tilts. The categorization perfor¬ 
mances indicate that our model is able to robustly 
recognize objects in such a severe situation. In ad¬ 
dition, several analytical techniques have been em¬ 
ployed to prove that the main contribution to this 
success is provided by the unsupervised STDP fea¬ 
ture learning, not by the classifier. Using represen¬ 
tational dissimilarity matrix, we have shown that 
the representation of input images in C 2 layer are 
more similar for within-category and dissimilar for 
between-category objects. In this way, as confirmed 
by the hierarchical clustering, the objects with the 
same category are represented in neighboring re¬ 
gions of C 2 feature space. Hence, even if using a 
simple classifier, our model is able to reach an ac¬ 
ceptable performance, while the random features 
fail. 
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Supplementary Information 

Here we provide the results of feature analysis techniques such as RDM and hierarchical clustering on 
ETH-80 dataset for both HMAX and our model. Some sample images of ETH-80 dataset are shown in 
. In Eig. |S3|and Eig. 

are presented, respectively. It can be seen that our model can better represent classes with high shape 
similarities such as tomato, apple, and pear or cow, horse, and dog with respect to the HMAX model. 
Also, the hierarchical clustering of whole training data based on their representations on feature spaces 
of our model and HMAX are demonstrated in Fig. [S5| and FigjSbl respectively. As for the 3D-Object 
dataset, HMAX feature extraction leads to a nested representation of different object classes which 
causes a poor classification accuracy. Here again a huge number of images which belong to different 
classes are assigned to a large cluster with lower than 0.14 internal dissimilarities. On the other hand, 
our model has distributed images of different classes in different regions of C2 feature space. Note that 
the largest cluster of our model includes the instances of tomato, apple, and pear classes which their 
shapes are so similar. 


S4 the RDMs of C2 features of our model and HMAX in eight view angels 


Fig. SI 




Figure SI: Some images of car class of ETH-80 dataset in different A) views, and B) tilts. 
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Figure S2: Three S 2 feature prototypes selective to the a) horse, b) pear, and c) cup classes of ETH-80 dataset along with their reconstructed 
preferred stimuli. 




(e) view 180° 


(f) View 225° 


(g) View 270° 


(h) View 315° 



Figure S3: RDMs for C2 features our model on ETH-80 corresponding to different viewpoints. 
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(g) View 270° 



(h) View 315° 
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Figure S4: RDMs of HMAX C2 features on ETH-80 corresponding to different viewpoints. 



Figure S5: The hierarchy of clusters and their labels for our model on ETH-80 dataset. The label of each cluster indicates the class with the 
highest frequency in that cluster. The cardinality of each cluster, C, and the cardinality of the class with the highest frequency, if, are placed 
below the cluster label as HjC. 
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Figure S6: The hierarchy of clusters and their labels for the HMAX model on ETH-80 dataset. The label of each cluster indicates the class 
with the highest frequency in that cluster. The cardinality of each cluster, C, and the cardinality of the class with the highest frequency, if, 
are placed below the cluster label as HjC. 
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